New perspectives on likelihood-based inference for latent and observed Gaussian mixture models

Ranalli, Monia

Fisher’s influence on modern statistics is enormous; he was the first one to turn the statistical thinking grown from a small obscure field into a systematic and methodological framework (Fisher 1922). One of the very appealing features of Fisher’s philosophy is its thinking naturally in terms of computational algorithms, and this allows to extend his methods to a much wider class of models through computing tools, such as the EM algorithm and MCMC sampling. However, nowadays, modern complex structure of data raises new challenges in likelihood-based inference methods. A predominant class of models, namely latent variable models (Bartholomew et al. 2011), seems to be the natural field in which this complexity arises. Within this framework, the EM algorithm as well as simulation-based methods find their straightforward application. Whatever the model is, there always exists a hypothetical complete data set (X,H), that may well be described by a probability model with a relatively simple structure. The key and the challenge (at the same time) is that only the X variables are observed, while the variables H are hidden. Although the model definition for X could be computationally complex, drawing a bigger picture including the hidden variables provides a simple framework for the interpretation of model parameters. The hidden variables may either be real (many data may be missing naturally) or fictitious (in the sense that they are created to help us tell a story about the data). Inference paradigm is in line with the simplification and idealization process involved in any thinking regarding a complex phenomenon: statistical inference attempts to deduce plausible models from a given set of observations. One popular class of latent variable models is represented by finite mixture models (see e.g. Lindsay 1995, McLachlan & Peel 2000). Finite mixture models have been intensively used in many fields and with different purposes, since they are able to model many features of real data that deviates from the ideal assumption of normality, such as asymmetry, skewness, multimodality or extravariability. Moreover, their success is mainly due to their simplicity to be fitted and interpreted. They arise naturally in a context in which the assumption of homogeneity is not reliable. Their mathematical structure implies that a population is a convex combination of a finite number of sub-populations represented by a finite number of densities. Furthermore, according to a clustering point of view, they provide a coherent strategy for classifying data accounting for uncertainties through probabilities. Despite these appealing features, the inference for these models is somewhat of a challenge. In this dissertation, the main problem addressed is that of exploring the use of surrogate functions aimed at approximating the likelihood in a mixture model framework. More precisely, these are proposed to make likelihood inference possible within two different scenarios. We deal with both the high dimensionality issue making inference infeasible and the unbounded likelihood making inference impossible. Although they are different in principle, both do not allow to make inference. These methods share the same idea: creating computationally easier or mathematically bounded surrogate models. Sometimes the assumptions in the surrogate are false or incomplete, but the analysis produced is likely to be approximately valid if the underlying assumptions are approximately true. In statistical words, this means that all these methods aim at minimizing the Kullback-Leibler information. To be more precise, the unbounded likelihood is an own mixture likelihood anomaly that occurs whatever the model complexity is. On the other hand, the intractable likelihood issue is implied by the complexity of the model, i.e. it is strictly linked to the dimensionality issue. However, in both cases new perspectives on likelihood-based inference are developed through the use of surrogate functions. This work includes three main contributions, presented in chapters 3, 5 and 6, respectively. The first two chapters are intended as background for the rest of the dissertation. In the first one, we review briefly the finite mixture model framework, exploring its mathematical structure and statistical properties. In the second chapter, we introduce the concept of surrogate function, describing some developments on the likelihood-based inference for latent variable models. This chapter opens up to new perspectives on likelihood-based inference for mixture models, as argued in the sequel. In the third chapter, we use a latent Gaussian mixture model to cluster ordinal data. The full maximum likelihood estimation requires the evaluation of multidimensional integrals that cannot be computed in a closed form. Unlike the existing literature (Everitt 1988, Lubke & Neale 2008), a class of alternative estimation methods has been considered, since the estimation becomes computationally infeasible as the number of observed variables increases (high dimensionality issue). A pairwise likelihood method, belonging to composite likelihood approach, has been adopted; as we show, it is a workable compromise between statistical and computational efficiency. We draw our motivation for this approach from the existing results in literature. Indeed the composite likelihood methods are flexible ways to create consistent estimators, which inherit the main desirable properties of the maximum likelihood estimators: they are asymptotically unbiased, normally distributed with the variance given by the inverse of the Godambe Information (Lindsay 1988, Varin et al. 2011). Moreover, they have some varying degrees of robustness (see Xu & Reid 2011), they are fully efficient and identical to the full maximum likelihood estimators in exponential families under a certain closure property (Mardia et al. 2009). In general efficiency is not easy to achieve and it is strictly linked to the design issue. The key idea behind the first proposal is to use a latent Gaussian mixture to capture the latent cluster structure of the observed ordinal data. To estimate the probability of a specific response pattern (i.e. a sequence of categories, each of which corresponds to one of the observed variable) the computation of a multidimensional integral is required. The complexity of the problem increases with the number of variables; thus a surrogate function is needed. In Chapter 5 we extend this model to the presence of noisy variables/dimensions, i.e. that do not contain information about the clustering structure. In literature, several techniques for simultaneous clustering and dimensionality reduction (SCR) have been proposed in a non-model based framework for quantitative (e.g. Vichi & Kiers 2001, Rocci et al. 2011) or categorical data (e.g. Van Buuren & Heiser 1989, Hwang et al. 2006). A model-based for SCR has been proposed by Kumar & Andreou (1998) for the analysis of continuous data. We propose a novel model-based for SCR on ordinal data. The observed variables are considered as a discretization of underlying first-order latent continuous variables. Thus, it is still needed to replace the full likelihood with a simpler objective function to obtain the parameter estimates (we use the pairwise likelihood). However, in this case, to detect noise dimensions, the latent variables are considered to be linear combinations of two independent sets of second-order latent variables, where only one contains the information about the cluster structure, while the other one contains noise dimensions. Technically, the variables in the first set are distributed as a finite mixture of Gaussians while in the second set as a multivariate normal. Nevertheless if there are not any noise variables and/or dimensions, the model reduces to clustering ordinal data. More precisely, it reduces to the proposal of chapter 3. In both cases, the parameter estimation is carried out through an EM-like algorithm. Moreover a useful tool for classifying the objects using the EM output is provided; it is based on the iterative proportional fitting algorithm. To evaluate the effectiveness of the proposals, simulation studies, comparative analysis with the existing models and/or approaches and applications to real data are provided. Despite the attractive idea behind the mixture models, i.e. taking the heterogeneity into account and investigating the latent cluster structure, we should be aware of some limitations. In chapter 6 we deal with a common likelihood anomaly occurring in a heteroscedastic Gaussian mixture context (Kiefer & Wolfowitz 1956). In this case the data are continuous and they are assumed to be distributed as a mixture model. Inference could be a very difficult task, since the likelihood is unbounded leading to a non-finite maximum likelihood estimator (unbounded likelihood issue). In the univariate case this happens when one component has mean equals to one sample observation and variance approaching zero. In the multivariate case the likelihood tends to infinity when the covariance matrix of one (or more) components becomes singular and the corresponding mean coincides with one observation along the direction given by the eigenvector associated to the eigenvalue of the covariance matrix that is going to zero. Different solutions has been provided in literature (e.g. Policello II 1981, Ingrassia & Rocci 2011). It is known that there exists a sequence of roots to the likelihood equation that is consistent and asymptotically efficient (Kiefer 1978, Peters & Walker 1978). Nevertheless, multiple local maxima can exist for a given sample; hence, the other major maximum-likelihood difficulty is in determining when the correct one has been found (see e.g. Hathaway (1985) and references therein). In chapter 6 we propose a marginal likelihood approach based on some invariant transformations. This requires the use of a Monte Carlo likelihood (Geyer & Thompson 1992, Geyer 1994). The intuition underlying the MC likelihood is to make everything ideally observable: it simplifies the inferential problems by simulating directly the hidden variable. The resulting estimators are still asymptotically unbiased, normally distributed with the variance given by the inverse of the Godambe Information (see e.g. Sung & Geyer 2007), but compared to the composite likelihood methods, it has a further attractive feature. The missing information can be quantified by the sample variance of the score function of the complete data. Hence information loss can be controlled by increasing the number of simulations (and conversely decreasing the score variance). It becomes very challenging when the missing information is larger than the observed information. However, the main drawback is given by the simulation step: the choice of the importance density has a strong influence on the behaviour of MCMLE (see Sung & Geyer 2007, Billio et al. 1998). The MC likelihood seems a powerful and promising method if the simulations are right. Ideally we would sample i.i.d. from fθ(H | X). This can be carried out through MCMC (particularly Gibbs Sampling), Importance Sampling (IS) or some hybrid strategies combining MCMC with IS. This should allow to solve two possible problems: the sparseness and shrinkage occurring when Gibbs sampler and IS are used alone. To resort the unbounded likelihood problem, we investigate the use of location invariant and location scale invariant likelihoods combined with the concept of a MC likelihood. The latter is obtained through an hybrid strategy, whose importance sampling scheme is innovative. We explore the effectiveness of the innovative hybrid strategy combining the strengths of Gibbs and importance sampling together. Furthermore, we focus on the main inference behind the proposed approach and we introduce a new way to estimate the Godambe Information. The present work leave some open issues that give us motivation for further research. Avenues of future developments are pointed out through the exposition. Nevertheless, we can conclude that the adopted approaches are different in principle, but they share the same background. The junction point is the likelihood and there exist many ways for creative thinking on the interaction between theory, methodology and applications that allow to move us forward.

New perspectives on likelihood-based inference for latent and observed Gaussian mixture models / Ranalli, Monia. - (2019), pp. 1-150.